Skip to main content

Fast vector quantization with 2-4 bit compression and SIMD search

Project description

turbovec — Google's TurboQuant for vector search

License PyPI version crates.io version TurboQuant paper


A 10 million document corpus takes 31 GB of RAM as float32. turbovec fits it in 4 GB - and searches it faster than FAISS.

turbovec is a Rust vector index with Python bindings, built on Google Research's TurboQuant algorithm — a data-oblivious quantizer with near-optimal distortion and no separate training phase.

  • Online ingest. Add vectors, they're indexed — no train step, no parameter tuning, no rebuilds as the corpus grows.
  • Fast SIMD search. Hand-written NEON (ARM) and AVX-512BW (x86) kernels beat FAISS IndexPQFastScan by 10–19% on ARM; on x86 they win the 4-bit configs and trail by a few percent on 2-bit.
  • Filter at search time. Pass an id allowlist (or a slot bitmask) to search() and the kernel honours it directly. You always get up to k results from the allowed set — no over-fetching, no recall hit on selective filters.
  • Pure local. No managed service, no data leaving your machine or VPC. Pair with any open-source embedding model for a fully air-gapped RAG stack.

Building RAG where privacy, memory, or latency matters? You're in the right place.

Python

pip install turbovec
from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)
index.add(more_vectors)

scores, indices = index.search(query, k=10)

index.write("my_index.tv")
loaded = TurboQuantIndex.load("my_index.tv")

Need stable ids that survive deletes? Use IdMapIndex:

import numpy as np
from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))

scores, ids = index.search(query, k=10)   # ids are your uint64 external ids
index.remove(1002)                         # O(1) by id

index.write("my_index.tvim")
loaded = IdMapIndex.load("my_index.tvim")

Hybrid retrieval (filtered search)

Restrict results to a candidate set produced by another system (SQL, BM25, ACL, time window, …):

import numpy as np
from turbovec import IdMapIndex

idx = IdMapIndex(dim=1536, bit_width=4)
idx.add_with_ids(vectors, ids)

# Stage 1: external system narrows to candidate ids.
allowed = np.array(db.execute("SELECT id FROM docs WHERE tenant=?", (t,)).fetchall(),
                   dtype=np.uint64)

# Stage 2: dense rerank within the candidate set.
scores, ids = idx.search(query, k=10, allowlist=allowed)

Filtering happens inside the SIMD kernel at 32-vector block granularity: blocks with no allowed slots are short-circuited before any LUT lookup or scoring work, and individual non-allowed slots inside scored blocks are dropped at heap-insert. Selective allowlists (small fraction of the index allowed) therefore avoid most of the SIMD cost rather than paying it and discarding the result afterwards.

The output length is min(k, len(allowed)) — when the allowlist is smaller than k you get exactly len(allowed) results rather than padded fallbacks.

See docs/api.md for the full reference.

Framework integrations

Drop-in replacements for the in-tree reference vector / document stores in each framework. Same public surface, same persistence semantics, same retriever and pipeline wiring — swap the import and keep your pipeline.

  • LangChainpip install turbovec[langchain] · replaces langchain_core.vectorstores.InMemoryVectorStore
  • LlamaIndexpip install turbovec[llama-index] · replaces llama_index.core.vector_stores.SimpleVectorStore
  • Haystackpip install turbovec[haystack] · replaces haystack.document_stores.in_memory.InMemoryDocumentStore
  • Agnopip install turbovec[agno] · replaces agno.vectordb.lancedb.LanceDb

Rust

cargo add turbovec
use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4).unwrap();
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();
let loaded = TurboQuantIndex::load("index.tv").unwrap();

For stable external ids that survive deletes:

use turbovec::IdMapIndex;

let mut index = IdMapIndex::new(1536, 4).unwrap();
index.add_with_ids(&vectors, &[1001, 1002, 1003]).unwrap();
let (scores, ids) = index.search(&queries, 10);
index.remove(1002);
index.write("index.tvim").unwrap();
let loaded = IdMapIndex::load("index.tvim").unwrap();

Recall

TurboQuant vs FAISS IndexPQ (LUT256, nbits=8) — the paper's Section 4.4 baseline. 100K vectors, k=64. FAISS PQ sub-quantizer counts sized to match TurboQuant's bit rate (m=d/4 at 2-bit, m=d/2 at 4-bit).

Recall GloVe d=200

Recall d=1536

Recall d=3072

Across OpenAI d=1536 and d=3072, TurboQuant beats FAISS by 0.2–1.9 points at R@1 across 2-bit and 4-bit, and both reach 1.0 by k=8 (≥0.997 already at k=4). GloVe d=200 is the harder regime — at low dim the asymptotic Beta assumption is looser. TurboQuant beats FAISS by 0.9 points at 4-bit and is effectively tied at 2-bit (within 0.1 points) at R@1, both tracking FAISS closely by k≈16.

A note on baselines. We compare against FAISS IndexPQ (LUT256, nbits=8, float32 LUT) because it's the default production-grade PQ most users would reach for. This is a stronger baseline than the custom u8-LUT PQ in the TurboQuant paper — FAISS uses a higher-precision LUT at scoring time and k-means++ for codebook training. We reproduce the paper's TurboQuant numbers on OpenAI d=1536 / d=3072 and hit similar numbers to other community reference implementations on low-dim embeddings (see turboquant-py at d=384). On GloVe (d=200) — the low-dim regime where the asymptotic Beta assumption is loosest — TurboQuant lands level with FAISS at 2-bit and ahead at 4-bit; TQ+ calibration closes the low-dim gap the base algorithm leaves.

Full results: d=1536 2-bit, d=1536 4-bit, d=3072 2-bit, d=3072 4-bit, GloVe 2-bit, GloVe 4-bit.

Compression

Compression

Search Speed

All benchmarks: 100K vectors, 1K queries, k=64, median of 5 runs.

ARM (Apple M3 Max)

ARM Speed — Single-threaded

ARM Speed — Multi-threaded

On ARM, TurboQuant beats FAISS FastScan by 10–19% across every config.

x86 (Intel Xeon Platinum 8481C / Sapphire Rapids, 8 vCPUs)

x86 Speed — Single-threaded

x86 Speed — Multi-threaded

On x86, TurboQuant wins the 4-bit configs by up to ~5% (d=3072 multi-threaded ties) and is modestly behind FAISS on 2-bit — most visibly d=1536 single-threaded (~8%), within a few percent on the rest — where FAISS's AVX-512 VBMI path has the edge on the short 2-bit accumulate loop.

How it works

Each vector is a direction on a high-dimensional hypersphere. TurboQuant compresses these directions using a simple insight: after applying a random rotation, every coordinate follows a known distribution -- regardless of the input data.

1. Normalize. Strip the length (norm) from each vector and store it as a single float. Now every vector is a unit direction on the hypersphere.

2. Random rotation. Multiply all vectors by the same random orthogonal matrix. After rotation, each coordinate independently follows a Beta distribution that converges to Gaussian N(0, 1/d) in high dimensions. This holds for any input data -- the rotation makes the coordinate distribution predictable.

3. Per-coordinate calibration (TQ+). The Beta distribution from step 2 is asymptotic — at finite dimensions, individual coordinates drift from the canonical shape (especially low-bit and word-vector-style embeddings). TQ+ fits two scalars per coordinate — a shift and a scale — during the first add, mapping each coordinate's empirical 5/95% quantiles onto the canonical Beta marginal. The Lloyd-Max codebook then quantizes against the target distribution it was designed for. The calibration is frozen after the first add and reused by subsequent adds — no retraining, no rebuilds, no separate train phase. Recall gain: up to +1.4pp at @1 on the cells that drift most (e.g. GloVe at 2-bit).

4. Lloyd-Max scalar quantization. Since the distribution is known, we can precompute the optimal way to bucket each coordinate. For 2-bit, that's 4 buckets; for 4-bit, 16 buckets. The Lloyd-Max algorithm finds bucket boundaries and centroids that minimize mean squared error. These are computed once from the math, not from the data.

5. Bit-pack. Each coordinate is now a small integer (0-3 for 2-bit, 0-15 for 4-bit). Pack these tightly into bytes. A 1536-dim vector goes from 6,144 bytes (FP32) to 384 bytes (2-bit). That's 16x compression.

6. Length-renormalized scoring. Scalar quantization systematically underestimates inner products — the reconstructed unit direction is a little shorter than the original. We compute one scalar per vector at encode time — the inner product of the rotated unit vector with its own centroid reconstruction — and store ||v|| / ⟨u, x̂⟩ alongside each compressed vector. The search kernel multiplies the per-candidate score by this scalar before heap insertion, turning the inner-product estimator from downward-biased into unbiased at zero search-time cost and zero extra storage. The recall gain shows up most at low bit widths, where the quantization shrinkage is largest.

Encoding cost: one extra d-dimensional dot product per vector to compute ⟨u, x̂⟩. On 1M vectors at d=1536 this is sub-second of additional encode time — a one-shot price paid at ingest, not at query.

Search. Instead of decompressing every database vector, we rotate the query once into the same domain and score directly against the codebook values. The scoring kernel uses SIMD intrinsics (NEON on ARM, AVX-512BW on modern x86 with an AVX2 fallback) with nibble-split lookup tables for maximum throughput.

The Lloyd-Max codebook achieves distortion within a factor of 2.7x of the information-theoretic lower bound (Shannon's distortion-rate limit); the length-renormalization step removes the residual bias the Lloyd-Max codebook introduces on the inner-product estimator itself.

Building

Python (via maturin)

pip install maturin
cd turbovec-python
maturin build --release
pip install target/wheels/*.whl

Rust

cargo build --release

All x86_64 builds target x86-64-v3 (AVX2 baseline, Haswell 2013+) via .cargo/config.toml. Any CPU that can run the AVX2 fallback kernel can run the whole crate — the AVX-512 kernel is gated at runtime via is_x86_feature_detected! and only kicks in on hardware that supports it.

Running benchmarks

Download datasets:

python3 benchmarks/download_data.py all            # all datasets
python3 benchmarks/download_data.py glove          # GloVe d=200
python3 benchmarks/download_data.py openai-1536    # OpenAI DBpedia d=1536
python3 benchmarks/download_data.py openai-3072    # OpenAI DBpedia d=3072

Each benchmark is a self-contained script in benchmarks/suite/. Run any one individually:

python3 benchmarks/suite/speed_d1536_2bit_arm_mt.py
python3 benchmarks/suite/recall_d1536_2bit.py
python3 benchmarks/suite/compression.py

Run all benchmarks for a category:

for f in benchmarks/suite/speed_*arm*.py; do python3 "$f"; done    # all ARM speed
for f in benchmarks/suite/speed_*x86*.py; do python3 "$f"; done    # all x86 speed
for f in benchmarks/suite/recall_*.py; do python3 "$f"; done       # all recall
python3 benchmarks/suite/compression.py                            # compression

Results are saved as JSON to benchmarks/results/. Regenerate charts:

python3 benchmarks/create_diagrams.py

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbovec-0.8.0.tar.gz (183.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbovec-0.8.0-cp39-abi3-win_amd64.whl (666.9 kB view details)

Uploaded CPython 3.9+Windows x86-64

turbovec-0.8.0-cp39-abi3-manylinux_2_28_x86_64.whl (13.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

turbovec-0.8.0-cp39-abi3-manylinux_2_28_aarch64.whl (6.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

turbovec-0.8.0-cp39-abi3-macosx_11_0_arm64.whl (843.2 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file turbovec-0.8.0.tar.gz.

File metadata

  • Download URL: turbovec-0.8.0.tar.gz
  • Upload date:
  • Size: 183.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for turbovec-0.8.0.tar.gz
Algorithm Hash digest
SHA256 4ff24956ef159cd8ccdb19c561b07eea3c191c344a23368cd0adad9cdd87382c
MD5 061a17d8617efa91216d44f681987fb6
BLAKE2b-256 1ed4e3735c9144d4d674f9188bc46e3a7b5d2102fa26517963447f28b1206de2

See more details on using hashes here.

Provenance

The following attestation bundles were made for turbovec-0.8.0.tar.gz:

Publisher: release-pypi.yml on RyanCodrai/turbovec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turbovec-0.8.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: turbovec-0.8.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 666.9 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for turbovec-0.8.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9b5d713f619597dc4e5f9614cac0adc167e861bca3272c66b7b2ba90f67b298f
MD5 b47f77242437f61cc5d0be154da2b34f
BLAKE2b-256 da9d2687d4e4c8fedf4d8ed8dfa4d4d6ea26c77c38bd9e56ad00fe20c46e6270

See more details on using hashes here.

Provenance

The following attestation bundles were made for turbovec-0.8.0-cp39-abi3-win_amd64.whl:

Publisher: release-pypi.yml on RyanCodrai/turbovec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turbovec-0.8.0-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for turbovec-0.8.0-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1195d4e15ee6f1404c1390817cce1f134561bbe6d3b3462c32e091a45a0f7e6d
MD5 d7c941bcef428fd1d64f396b949082ba
BLAKE2b-256 f800971e9d614f145d9363415aec000cf09eae2cd26639478887637ece9d4fe1

See more details on using hashes here.

Provenance

The following attestation bundles were made for turbovec-0.8.0-cp39-abi3-manylinux_2_28_x86_64.whl:

Publisher: release-pypi.yml on RyanCodrai/turbovec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turbovec-0.8.0-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for turbovec-0.8.0-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d6e3dbeda655f3bb7fcaa68474ae105dd99c3f13ab7da6e980fe4d939667a26a
MD5 a08fe8458124276bee8608cdbbf69696
BLAKE2b-256 6ce9923fe79602f5b51dde1f7d1d631990b2fba1719e4ae0c5b33ca15c6461df

See more details on using hashes here.

Provenance

The following attestation bundles were made for turbovec-0.8.0-cp39-abi3-manylinux_2_28_aarch64.whl:

Publisher: release-pypi.yml on RyanCodrai/turbovec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turbovec-0.8.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbovec-0.8.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 323f8e1775865e9c9ed1ac33b3d44b82eae725c5e3f5473a8a423eec9e82d387
MD5 97648e13422c360ffc08b17023521663
BLAKE2b-256 00d6cabba621300d06e0d496f3af5edc0048a0796cec17e1fb656d682b4c64ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for turbovec-0.8.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: release-pypi.yml on RyanCodrai/turbovec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page